data replication
Finding Dori: Memorization in Text-to-Image Diffusion Models Is Not Local
Kowalczuk, Antoni, Hintersdorf, Dominik, Struppek, Lukas, Kersting, Kristian, Dziedzic, Adam, Boenisch, Franziska
Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such defenses. Our further analysis then provides multiple indications that memorization is indeed not inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the nature of memorization in text-to-image DMs and inform the development of more reliable mitigations against DM memorization.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
- North America > United States > Michigan (0.04)
- Asia > Singapore (0.04)
- Asia > Nepal (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- North America > United States (0.14)
- Asia (0.14)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio
Batlle-Roca, Roser, Liao, Wei-Hisang, Serra, Xavier, Mitsufuji, Yuki, Gómez, Emilia
Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant discussion and related technical challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectual property rights violations. To tackle this issue, we present the Music Replication Assessment (MiRA) tool: a model-independent open evaluation method based on diverse audio music similarity metrics to assess data replication. We evaluate the ability of five metrics to identify exact replication by conducting a controlled replication experiment in different music genres using synthetic samples. Our results show that the proposed methodology can estimate exact data replication with a proportion higher than 10%. By introducing the MiRA tool, we intend to encourage the open evaluation of music-generative models by researchers, developers, and users concerning data replication, highlighting the importance of the ethical, social, legal, and economic consequences. Code and examples are available for reproducibility purposes.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Spain (0.04)
- Europe > Netherlands > Utrecht (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.88)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Law (1.00)
From Trojan Horses to Castle Walls: Unveiling Bilateral Backdoor Effects in Diffusion Models
Pan, Zhuoshi, Yao, Yuguang, Liu, Gaowen, Shen, Bingquan, Zhao, H. Vicky, Kompella, Ramana Rao, Liu, Sijia
While state-of-the-art diffusion models (DMs) excel in image generation, concerns regarding their security persist. Earlier research highlighted DMs' vulnerability to backdoor attacks, but these studies placed stricter requirements than conventional methods like 'BadNets' in image classification. This is because the former necessitates modifications to the diffusion sampling and training procedures. Unlike the prior work, we investigate whether generating backdoor attacks in DMs can be as simple as BadNets, i.e., by only contaminating the training dataset without tampering the original diffusion process. In this more realistic backdoor setting, we uncover bilateral backdoor effects that not only serve an adversarial purpose (compromising the functionality of DMs) but also offer a defensive advantage (which can be leveraged for backdoor defense). Specifically, we find that a BadNets-like backdoor attack remains effective in DMs for producing incorrect images (misaligned with the intended text conditions), and thereby yielding incorrect predictions when DMs are used as classifiers. Meanwhile, backdoored DMs exhibit an increased ratio of backdoor triggers, a phenomenon we refer to as `trigger amplification', among the generated images. We show that this latter insight can be used to enhance the detection of backdoor-poisoned training data. Even under a low backdoor poisoning ratio, studying the backdoor effects of DMs is also valuable for designing anti-backdoor image classifiers. Last but not least, we establish a meaningful linkage between backdoor attacks and the phenomenon of data replications by exploring DMs' inherent data memorization tendencies. The codes of our work are available at https://github.com/OPTML-Group/BiBadDiff.
- North America > United States > Michigan (0.04)
- Asia > Singapore (0.04)
- Asia > Nepal (0.04)
Testing for the Markov Property in Time Series via Deep Conditional Generative Learning
Zhou, Yunzhe, Shi, Chengchun, Li, Lexin, Yao, Qiwei
The Markov property is widely imposed in analysis of time series data. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance. In this article, we propose a nonparametric test for the Markov property in high-dimensional time series via deep conditional generative learning. We also apply the test sequentially to determine the order of the Markov model. We show that the test controls the type-I error asymptotically, and has the power approaching one. Our proposal makes novel contributions in several ways. We utilize and extend state-of-the-art deep generative learning to estimate the conditional density functions, and establish a sharp upper bound on the approximation error of the estimators. We derive a doubly robust test statistic, which employs a nonparametric estimation but achieves a parametric convergence rate. We further adopt sample splitting and cross-fitting to minimize the conditions required to ensure the consistency of the test. We demonstrate the efficacy of the test through both simulations and the three data applications.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
What is the Data Architecture we Need?
In the new era of Big Data and Data Sciences, it is vitally important for an enterprise to have a centralized data architecture aligned with business processes, which scales with business growth and evolves with technological advancements. A successful data architecture provides clarity about every aspect of the data, which enables data scientists to work with trustable data efficiently and to solve complex business problems. It also prepares an organization to quickly take advantage of new business opportunities by leveraging emerging technologies and improves operational efficiency by managing complex data and information delivery throughout the enterprise. When compared with information architecture, system architecture, and software architecture, data architecture is relatively new. The role of Data Architects has also been nebulous and has fallen on the shoulders of senior business analysts, ETL developers, and data scientists.
How to transform your SAP S/4HANA system in a machine learning (ML) power house by exposing your tailor-made ML models
In this blog, explain how you can use your SAP S/4HANA system to execute machine learning tasks. We sketch a step-by-step guide on how to process the data, execute the relevant database procedure and discuss potential ways to expose the results. The advantages of this approach are that no data replication is needed and the use of database procedures offer great performance. Note that this approach requires custom development on the S/4HANA system. You may wonder why you want to do machine learning on an S/4HANA System?
Machine learning is driving demand for data replication
Data for the enterprise is now a currency of its own, yet many companies and institutions are still trying to navigate the moving of large volumes of data from on-premise to the cloud in an effort to capitalize on the value of data stored in many locations. "I think longer-term the economic advantage of using cloud environments are undeniable. The cost advantages of hosting information in the cloud, the benefits that come from the scalability of those environments is far surpassing capabilities that organizations can invest in themselves or their own data centers," said Paul Scott-Murphy, vice president of product management, big data/cloud, at WANdisco Inc. During the Google Cloud Next event, Scott-Murphy spoke with Stu Miniman (@stu), host of theCUBE, SiliconANGLE Media's mobile live streaming studio, at SiliconANGLE's Palo Alto, CA, studio to discuss the trends WANdisco is seeing with its customers, as well as news from Google Cloud Next. WANdisco's enterprise and institutional customers are all facing similar problem: The availability of data and the combination of where it is stored makes it difficult to access and derive any benefits for them.
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (0.44)
- Information Technology > Data Science > Data Mining > Big Data (0.37)